Example jupyter_spark notebook

This is an example notebook to demonstrate the jupyter_spark notebook plugin.

It is based on the approximating pi example in the pyspark documentation. This works by sampling random numbers in a square and counting the number that fall inside the unit circle.


In [1]:
import sys
from random import random
from operator import add

from pyspark.sql import SparkSession

Create a SparkSession and give it a name.

Note: This will start the spark client console -- there is no need to run spark-shell directly.


In [2]:
spark = SparkSession \
            .builder \
            .appName("PythonPi") \
            .getOrCreate()

partitions is the number of spark workers to partition the work into.


In [3]:
partitions = 2

n is the number of random samples to calculate


In [4]:
n = 100000000

This is the sampling function. It generates numbers in the square from (-1, -1) to (1, 1), and returns 1 if it falls inside the unit circle, and 0 otherwise.


In [5]:
def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 <= 1 else 0

Here's where we farm the work out to Spark.


In [6]:
count = spark.sparkContext \
    .parallelize(range(1, n + 1), partitions) \
    .map(f) \
    .reduce(add)

In [7]:
print("Pi is roughly %f" % (4.0 * count / n))


Pi is roughly 3.141880

Shut down the spark server.


In [8]:
spark.stop()